Spatial Analysis and Statistical Modeling with R and spmodel

2024 Society for Freshwater Science Conference

Michael Dumelle

EPA (USA)

Ryan Hill

EPA (USA)

6/2/24

Welcome!

  1. Please visit https://usepa.github.io/spworkshop.sfs24/ to view the workshop’s accompanying workbook

  2. If necessary, install and load R packages by visiting “Setup” in the workbook’s “Welcome” section

  3. (Optional) Download the workshop slides (instructions in the workbook’s “Welcome” section)

  4. Follow along and have fun!

Who Are We?

Michael Dumelle (he/him/his) is a statistician for the United States Environmental Protection Agency (USEPA). He works primarily on facilitating the survey design and analysis of USEPA’s National Aquatic Resource Surveys (NARS), which characterize the condition of waters across the United States. His primary research interests are in spatial statistics, survey design, environmental and ecological applications, and software development.

Who Are We?

Ryan Hill (he/him/his) is an aquatic ecologist with the U.S. EPA Office of Research and Development. He is interested in how watershed conditions drive differences in freshwater diversity and water quality across the United States. He has worked extensively with federal physical, chemical, and biological datasets to gain insights into the factors affecting water quality and biotic condition of freshwaters across the conterminous US. He has also worked to develop and distribute large datasets of geospatial watershed metrics of streams and lakes for the Agency (EPA’s StreamCat and LakeCat datasets).

Who Are We?

Lara Jansen (she/her/hers) is an aquatic community ecologist and an ORISE postdoctoral fellow working on predictive models of benthic macroinvertebrate communities across the conterminous US in relation to watershed factors. Lara completed her PhD in Environmental Science at Portland State University in 2023, studying the drivers and dynamics of harmful algal blooms in mountain lakes with Dr. Angela Strecker.She obtained a MS in Natural Resource Sciences at Cal Poly Humboldt University with a thesis focused on the downstream impacts of dam flow regulation on benthic macroinvertebrate and algal communities.

Who Are We?

Wade Boys (he/him/his) is a graduate student at the University of Arkansas broadly interested in understanding how aquatic ectotherms will respond to climate change, especially the role of phenotypic plasticity in adapting to a warming world. Wade is a firm believer that science is not finished until it is communicated. In addition to research, he finds great purpose in cultivating community and connecting science to our everyday experiences as humans.

Disclaimer

The views expressed in this workshop are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government or the U.S. Environmental Protection Agency. The U.S. Environmental Protection Agency does not endorse any commercial products, services, or enterprises.

What is spmodel?

spmodel is an R package to fit, summarize, and predict for a variety of spatial statistical models. Some of the things that spmodel can do include:

  • Fit spatial linear and generalized linear models for point-referenced and areal (lattice) data
  • Compare model fits and inspect model diagnostics
  • Predict at unobserved spatial locations (i.e., Kriging)
  • And much more!

Why use spmodel?

There are many great spatial modeling packages in R. A few reasons to use spmodel for spatial analysis are that:

  • spmodel syntax is similar to base R syntax for functions like lm(), glm(), summary(), and predict(), making the transition from fitting non-spatial models to spatial models relatively seamless.
  • There are a wide variety of spmodel capabilities that give the user significant control over the specific spatial model being fit.
  • spmodel is compatible with other modern R packages like broom and sf.

Spatial Linear Models in spmodel

Goals

  1. Construct and describe the spatial linear model
  2. Fit a spatial linear model using spmodel
  3. Make spatial predictions at unobserved locations (i.e., Kriging).

Construct and Describe the Spatial Linear Model

  1. Review the nonspatial linear model with independent random errors.
  2. Explain how the spatial linear model differs from the linear model with independent random errors.
  3. Explain how modeling for point-referenced data with distance-based model covariances differs from modeling for areal data with neighborhood-based model covariances.

Nonspatial Linear Models

  • Nonspatial linear models are incredibly flexible tools that quantify the relationship between a response variable and a set of explanatory variables

  • Examples of nonspatial linear models: multiple linear regression, analysis of variance (ANOVA), splines, polynomial regression, additive models, and mixed effect models

\[ \begin{split} \text{y} & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k + \epsilon \\ i & = 1, 2, \dots, k \end{split} \]

Nonspatial Linear Models

Generalizing becomes

\[ \begin{split} \text{y}_j & = \beta_0 + \beta_1 x_{1, j} + \beta_2 x_{2, j} + \dots + \beta_k x_{k, j} + \epsilon_j \\ i & = 1, 2, \dots, k \\ j & = 1, 2, \dots, n \end{split} \]

Nonspatial Linear Models

Matrix notation:

\[ \mathbf{y} = \begin{bmatrix} \text{y}_1 \\ \text{y}_2 \\ \vdots \\ \text{y}_j \\ \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{1, 1} & x_{1, 2} & \dots & x_{1, k} \\ 1 & x_{2, 1} & x_{2, 2} & \dots & x_{2, k} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n, 1} & x_{n, 2} & \dots & x_{n, k} \\ \end{bmatrix}, \]

Nonspatial Linear Models

Matrix notation:

\[ \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}, \text{ and } \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_0 \\ \epsilon_1 \\ \vdots \\ \epsilon_j \end{bmatrix}, \] The nonspatial linear model is written as \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]

Your Turn

Introduce yourself to one (or more) neighbor(s)! What kinds of data have you used statistical modeling for? Have you ever used linear models via lm() in R?

03:00

Spatial Linear Models

  • Ignoring this spatial dependence and fitting nonspatial linear models (for spatially dependent data) generally leads to invalid fixed effect inference and poor predictions
  • Spatial linear models leverage spatial dependence to improve model fit.
  • Models historically challenging, theoretically and computationally
  • Fortunately, spmodel makes it straightforward to implement these models.

Spatial Linear Models

  • Add a spatially dependent random error (\(\boldsymbol{\tau}\)) that accounts for dependence (i.e., covariance, correlation) in space

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \]

  • Can choose various ways to describe the covariance in \(\boldsymbol{\tau}\) (exponential, Gaussian, autoregressive, etc.)
  • Applies to point-referenced (e.g., locations in a field) and areal data (e.g., counties in a state)
  • splm() and spautor()

Fit a Spatial Linear Model Using spmodel

  1. Explore the splm() function using the moss data.
  2. Connect parameter estimates in the summary output of splm() to the spatial linear model.

The moss Data

The moss data contains a variable for log Zinc concentration for moss samples collected near a mining road in Alaska.

ggplot(moss, aes(color = log_Zn)) +
  geom_sf(size = 2) +
  scale_color_viridis_c() +
  scale_x_continuous(breaks = seq(-163, -164, length.out = 2)) +
  theme_gray(base_size = 18)

The moss Data

Figure 1: Distribution of moss data.

The moss Data

sideroad log_dist2road log_Zn geometry
N 2.68 7.33 POINT (-413585.3 1997623)
N 2.68 7.38 POINT (-413585.3 1997623)
N 2.54 7.58 POINT (-415367.2 1996769)

The splm() function

The splm() function shares syntactic structure with the lm() function and generally requires at least three arguments

  1. formula: a formula that describes the relationship between the response variable (\(\mathbf{y}\)) and explanatory variables (\(\mathbf{X}\))
  2. data: a data.frame or sf object that contains the response variable, explanatory variables, and spatial information.
  3. spcov_type: the spatial covariance type ("exponential", "Gaussian", "spherical", etc; 17 total types)

The splm() function

spmod <- splm(formula = log_Zn ~ log_dist2road, data = moss, 
              spcov_type = "exponential")
summary(spmod)

Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_type = "exponential")

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6801 -1.3606 -0.8103 -0.2485  1.1298 

Coefficients (fixed):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    9.76825    0.25216   38.74   <2e-16 ***
log_dist2road -0.56287    0.02013  -27.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared: 0.683

Coefficients (exponential spatial covariance):
       de        ie     range 
3.595e-01 7.897e-02 8.237e+03 

Separating Sources of Variation

  • Pseudo R-squared: The proportion of variability explained by the fixed effects
  • de: The proportion of variability attributable to spatial random error
  • ie: The proportion of variability attributable to independent random error
varcomp(spmod)
# A tibble: 3 × 2
  varcomp            proportion
  <chr>                   <dbl>
1 Covariates (PR-sq)     0.683 
2 de                     0.260 
3 ie                     0.0571

Tidying Output

The tidy() function tidies fixed effect model output into a convenient tibble (a special data.frame)

tidy(spmod)
# A tibble: 2 × 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)      9.77     0.252       38.7       0
2 log_dist2road   -0.563    0.0201     -28.0       0
  • When log_dist2road is zero, the average log_Zn value is 9.768
  • A one unit increase in log_dist2road is associated with an average decrease of 0.563 in log_Zn

Tidying Output

It can also be used to tidy spatial covariance parameters

tidy(spmod, effects = "spcov")
# A tibble: 3 × 3
  term   estimate is_known
  <chr>     <dbl> <lgl>   
1 de       0.360  FALSE   
2 ie       0.0790 FALSE   
3 range 8237.     FALSE   
  • range: A parameter that controls the distance-decay rate of the spatial covariance

Model Fit

The glance() function returns columns with the sample size (n), number of fixed effects (p), number of estimated covariance parameters (npar), optimization criteria minimum (value), AIC (AIC), AICc (AICc), log-likelihood (loglik), deviance (deviance), and pseudo R-squared (pseudo.r.squared)

glance(spmod)
# A tibble: 1 × 9
      n     p  npar value   AIC  AICc logLik deviance pseudo.r.squared
  <int> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1   365     2     3  367.  373.  373.  -184.     363.            0.683

Model Diagnostics

The augment() function returns columns with

  • .fitted, the fitted value, calculated from the estimated fixed effects in the model
  • .resid, the raw residual (observed minus fitted)
  • .hat, the Mahalanobis distance, a metric of leverage
  • .cooksd, the Cook’s distance, a metric of influence
  • .std.resid, the standardized residual
augment(spmod)

Model Diagnostics

Simple feature collection with 365 features and 7 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -445884.1 ymin: 1929616 xmax: -383656.8 ymax: 2061414
Projected CRS: NAD83 / Alaska Albers
# A tibble: 365 × 8
   log_Zn log_dist2road .fitted .resid   .hat .cooksd .std.resid
 *  <dbl>         <dbl>   <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
 1   7.33          2.68    8.26 -0.928 0.0200 0.0142       1.18 
 2   7.38          2.68    8.26 -0.880 0.0200 0.0186       1.35 
 3   7.58          2.54    8.34 -0.755 0.0225 0.00482      0.647
 4   7.63          2.97    8.09 -0.464 0.0197 0.0305       1.74 
 5   7.26          2.72    8.24 -0.977 0.0215 0.131        3.45 
 6   7.65          2.76    8.21 -0.568 0.0284 0.0521       1.89 
 7   7.59          2.30    8.47 -0.886 0.0300 0.0591       1.96 
 8   7.16          2.78    8.20 -1.05  0.0335 0.00334      0.439
 9   7.19          2.93    8.12 -0.926 0.0378 0.0309       1.26 
10   8.07          2.79    8.20 -0.123 0.0314 0.00847      0.723
# ℹ 355 more rows
# ℹ 1 more variable: geometry <POINT [m]>

Your Turn

Run the fitted(), residuals(), hatvalues(), cooks.distance(), and rstandard() functions and verify they return the same values as augment().

05:00

Plotting Model Objects

The plot() function can be used on a fitted model object to construct a few pre-specified plots of these model diagnostics. For example, the following code plots the Cook’s distance, a measure of influence, which quantifies each observation’s impact on model fit:

plot(spmod, which = 4)

Powerful to combine diagnostics with spatial visualizations via augment()

Plotting Model Objects

Your Turn

Use spmodel’s plot function on the spmod object to construct a plot of the fitted spatial covariance vs spatial distance. To learn more about the options for spmodel’s plot function, run ?plot.spmodel.

03:00

Model Comparison

  • Fit a nonspatial linear model using splm() and spcov_type = "none"
none <- splm(formula = log_Zn ~ log_dist2road, data = moss,
              spcov_type = "none")

Your Turn

Using sumamry(), compare the output of the none model fit via splm() to a model fit via lm(). Do you get the same estimates and standard errors?

03:00

AIC and AICc

  • glances() lets us compare the fit of several models simultaneously
  • The lower the AIC/AICc, the better the fit
glances(spmod, none)
# A tibble: 2 × 10
  model     n     p  npar value   AIC  AICc logLik deviance pseudo.r.squared
  <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1 spmod   365     2     3  367.  373.  373.  -184.     363.            0.683
2 none    365     2     1  634.  636.  636.  -317.     363.            0.671
  • The spatial model has a much lower AIC/AICc
  • We do not recommend comparing pseudo R-squared across models

REML vs ML

  • The default estimation method is restricted maximum likelihood (REML, estmethod = "reml")
  • To compare models having different fixed effect structures with AIC/AICc, the models must be fit using maximum likelihood (ML, estmethod = "ml")
  • Generally few differences between REML and ML unless sample sizes are small
    • When sample sizes are small, REML can notably outperform ML
  • Refit a “final” model using REML (more on this)

Leave-One-Out Cross Validation

Leave-one-out cross validation (loocv):

  • Hold-out observation one, fit model to rest of data and predict observation one
  • Hold-out observation two, fit model to rest of data and predict observation two
  • Hold-out observation \(n\), fit model to rest of data and predict observation \(n\)
  • Compare observed data to loocv predictions

Leave-One-Out Cross Validation

loocv() returns several useful fit statistics:

  • bias: The average difference between the observed value and its leave-one-out prediction. This should be close to zero for well-fitting models.
  • MSPE: The mean squared prediction error between the observed value and its leave-one-out prediction.
  • RMSPE: The square root of MSPE.
  • cor2: The squared correlation between the observed value and its leave-one-out prediction. Can be viewed as a “predictive” R-squared (can be compared across models).

Leave-One-Out Cross Validation

loocv(spmod)
# A tibble: 1 × 4
     bias  MSPE RMSPE  cor2
    <dbl> <dbl> <dbl> <dbl>
1 0.00655 0.111 0.333 0.886
loocv(none)
# A tibble: 1 × 4
      bias  MSPE RMSPE  cor2
     <dbl> <dbl> <dbl> <dbl>
1 0.000644 0.324 0.569 0.667
  • The spatial model has a much lower (better) MSPE/RMSPE and higher (better) cor2.
  • Unlike AIC/AICc, loocv() can be used to compare two models fit using REML that have different fixed effect structures

Model Comparison Strategies

  • All models are wrong, but some are useful
  • Model selection is quite a challenging problem
  • Fit a single hypothesized model? Use an algorithm to find a model? Somewhere in between?
  • Combine expert knowledge of the system with empirical tools to aid decision making
  • Generally, fit the “final” model using REML

Spatial Prediction (i.e., Kriging)

  1. Predict the response value at an unobserved location for point-referenced data using the moose data.
  2. Quantify prediction uncertainty.
  3. Perform leave-one-out cross-validation

The moose Data

  • We will practice with some new data in spmodel called moose (and moose_preds)
  • The moose data contains moose counts and moose presence for 218 spatial locations in Alaska.
ggplot(data = moose, aes(colour = count)) +
  geom_sf(size = 2) +
  scale_colour_viridis_c(limits = c(0, 40)) +
  theme_minimal(base_size = 18)

The moose Data

Figure 2: Distribution of moose data.

The moose Data

elev strat count presence geometry
469 L 0 0 POINT (293542.6 1541016)
362 L 0 0 POINT (298313.1 1533972)
173 M 0 0 POINT (281896.4 1532516)

The moose_preds Data

  • The moose_preds data contains spatial locations that were not surveyed for which predictions are desired
ggplot(data = moose_preds) +
  geom_sf(size = 2) +
  theme_minimal(base_size = 18)

The moose_preds Data

Figure 3: Locations in moose prediction data.

The moose_preds Data

elev strat geometry
143.4000 L POINT (401239.6 1436192)
324.4375 L POINT (352640.6 1490695)
158.2632 L POINT (360954.9 1491590)

Fit a Spatial Linear Model

  • elev * strat is shorthand for elev + strat + elev:strat
moosemod <- splm(count ~ elev * strat, data = moose,
                  spcov_type = "spherical")
tidy(moosemod)
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   0.310    9.02       0.0344 0.973  
2 elev          0.0141   0.00806    1.76   0.0792 
3 stratM        6.93     2.26       3.07   0.00217
4 elev:stratM  -0.0273   0.0130    -2.10   0.0357 
  • A linear model for count data? (more on this later)

Spatial Predictions For moose_preds

Using predict()

# results omitted
predict(moosemod, newdata = moose_preds)

Using augment()

moose_aug <- augment(moosemod, newdata = moose_preds)
moose_aug

Spatial Predictions For moose_preds

Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
    elev strat .fitted           geometry
 * <dbl> <chr>   <dbl>        <POINT [m]>
 1  143. L       3.45  (401239.6 1436192)
 2  324. L       1.59  (352640.6 1490695)
 3  158. L      -0.267 (360954.9 1491590)
 4  221. M       2.39  (291839.8 1466091)
 5  209. M       7.62  (310991.9 1441630)
 6  218. L      -1.02  (304473.8 1512103)
 7  127. L      -1.23  (339011.1 1459318)
 8  122. L      -1.43  (342827.3 1463452)
 9  191  L      -0.239 (284453.8 1502837)
10  105. L       0.657 (391343.9 1483791)
# ℹ 90 more rows

Spatial Predictions For moose_preds

We can construct a plot of the predictions with

ggplot(data = moose, aes(colour = count)) +
  geom_sf(alpha = 0.4) +
  geom_sf(data = moose_aug, aes(colour = .fitted)) +
  scale_colour_viridis_c(limits = c(0, 40)) +
  theme_minimal()

Spatial Predictions For moose_preds

Figure 4: Distribution of moose predictions.

Your Turn

Examine the help file ?augment.spmodel or by visiting this link and create site-wise 99% prediction intervals for the unsampled locations found in moose_preds.

05:00

Additional Features

Applications to:

  • Big spatial data
  • Nonspatial random effects
  • Simulating spatial data
  • And much more!
  • See Chapter 2 of the workbook

Spatial Generalized Linear Models in spmodel

Goals

  1. Explain how modeling spatial covariance fits within the structure of a generalized linear model.
  2. Use the spglm() function in spmodel to fit generalized linear models for various model families (i.e., response distributions).

The Spatial Generalized Linear Model

The spatial generalized linear model can be written as

\[ g(\boldsymbol{\mu}) = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \]

  • \(g(\boldsymbol{\mu})\) is the link function that “links” a function of the mean of \(\mathbf{y}\) to \(\mathbf{X} \boldsymbol{\beta}\), \(\boldsymbol{\tau}\), and \(\boldsymbol{\epsilon}\)

Response Distributions

Response distributions and link functions available in spmodel
Distribution Data Type Link Function
Poisson Count Log
Negative Binomial Count Log
Binomial Binary Logit
Beta Proportion Logit
Gamma Skewed Log
Inverse Gaussian Skewed Log

The moose Data

elev strat count presence geometry
468.9 L 0 0 POINT (293542.6 1541016)
362.3 L 0 0 POINT (298313.1 1533972)
172.8 M 0 0 POINT (281896.4 1532516)

Fit a Spatial Generalized Model

  • What about a count model for moose?
poismod <- spglm(count ~ elev * strat, data = moose,
                 family = poisson, spcov_type = "spherical")
  • The family argument can be binomial, beta, poisson, nbinomial, Gamma, or inverse.gaussian
  • Similarities between spglm() and glm()

Fit a Spatial Generalized Model

summary(poismod)

Call:
spglm(formula = count ~ elev * strat, family = poisson, data = moose, 
    spcov_type = "spherical")

Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-1.4245 -0.7783 -0.3653  0.1531  0.5900 

Coefficients (fixed):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.230575   0.958201  -2.328 0.019919 *  
elev         0.007623   0.003129   2.437 0.014820 *  
stratM       2.752234   0.782853   3.516 0.000439 ***
elev:stratM -0.010248   0.004472  -2.292 0.021928 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared: 0.09573

Coefficients (spherical spatial covariance):
       de        ie     range 
    3.892     1.163 51204.657 

Coefficients (Dispersion for poisson family):
dispersion 
         1 

Your Turn

Fit a spatial negative binomial model to the moose data with count as the response, elev, strat, and their interaction as predictors, and the "gaussian" spatial covariance function. Compare their fits using glances() and loocv(). Which model is preferable based on AIC/AICc? What about leave-one-out MSPE?

05:00

Spatial Predictions for moose_preds

Using predict()

# results omitted
predict(poismod, newdata = moose_preds)

Using augment()

augment(poismod, newdata = moose_preds)

Spatial Predictions for moose_preds

Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
    elev strat .fitted           geometry
 * <dbl> <chr>   <dbl>        <POINT [m]>
 1  143. L      0.207  (401239.6 1436192)
 2  324. L     -0.0563 (352640.6 1490695)
 3  158. L     -1.24   (360954.9 1491590)
 4  221. M     -1.16   (291839.8 1466091)
 5  209. M      1.78   (310991.9 1441630)
 6  218. L     -1.84   (304473.8 1512103)
 7  127. L     -2.80   (339011.1 1459318)
 8  122. L     -2.45   (342827.3 1463452)
 9  191  L     -0.409  (284453.8 1502837)
10  105. L     -1.10   (391343.9 1483791)
# ℹ 90 more rows

Your Turn

Use spglm() to fit a spatial logistic regression model to the moose data using presence as the response variable and a Cauchy covariance function. Then, find the predicted probabilities that moose are present at the spatial locations in moose_preds (Hint: Use the type argument in predict() or augment()).

05:00

GIS in R

Maintaining all analyses within a single software (R) can greatly simplify your research workflow. In this section, we’ll cover the basics of doing GIS in R.

Goals and Motivation

  • Understand the main features and types of vector data.
  • Generate point data from a set of latitudes and longitudes, such as from fields sites.
  • Read, write, query, and manipulate vector data using the sf package.

Points, lines, and polygons

Figure 5: Vector data. Image from: https://earthdatascience.org/courses/earth-analytics/spatial-data-r/intro-vector-data-r/

Points, lines, and polygons

We can represent these features in R without actually using GIS packages.

library(tidyverse)
library(ggplot2)

id <- c(1:5)
cities <- c('Ashland','Corvallis','Bend','Portland','Newport')
longitude <- c(-122.699, -123.275, -121.313, -122.670, -124.054)
latitude <- c(42.189, 44.57, 44.061, 45.523, 44.652)
population <- c(20062, 50297, 61362, 537557, 9603)

oregon_cities <- data.frame(id, cities, longitude, latitude, population)

Points, lines, and polygons

ggplot(
  data = oregon_cities, 
  aes(x = longitude, y = latitude, size = population, label = cities)
) +
  geom_point() +
  geom_text(hjust = 1, vjust = 1) +
  theme_bw()

Points, lines, and polygons

Figure 6: Oregon cities plotted from data frame.

Points, lines, and polygons

So, is this sufficient for working with spatial data in R and doing spatial analysis? What are we missing?

If you have worked with vector data before, you may know that these data also usually have:

  • A coordinate reference system
  • A bounding box or extent
  • Plot order
  • Additional data

Exploring the Simple Features (sf) package

  • The sf package provides simple features access for R
  • sf fits in within the “tidy” approach to data of Hadley Wickham’s tidyverse
  • In short, much of what used to require ArcGIS license can now be done in R with sf

Exploring the Simple Features (sf) package

library(sf)
ls("package:sf")
  [1] "%>%"                          "as_Spatial"                  
  [3] "dbDataType"                   "dbWriteTable"                
  [5] "gdal_addo"                    "gdal_create"                 
  [7] "gdal_crs"                     "gdal_extract"                
  [9] "gdal_inv_geotransform"        "gdal_metadata"               
 [11] "gdal_polygonize"              "gdal_rasterize"              
 [13] "gdal_read"                    "gdal_read_mdim"              
 [15] "gdal_subdatasets"             "gdal_utils"                  
 [17] "gdal_write"                   "gdal_write_mdim"             
 [19] "get_key_pos"                  "NA_agr_"                     
 [21] "NA_bbox_"                     "NA_crs_"                     
 [23] "NA_m_range_"                  "NA_z_range_"                 
 [25] "plot_sf"                      "rawToHex"                    
 [27] "read_sf"                      "sf.colors"                   
 [29] "sf_add_proj_units"            "sf_extSoftVersion"           
 [31] "sf_proj_info"                 "sf_proj_network"             
 [33] "sf_proj_pipelines"            "sf_proj_search_paths"        
 [35] "sf_project"                   "sf_use_s2"                   
 [37] "st_agr"                       "st_agr<-"                    
 [39] "st_area"                      "st_as_binary"                
 [41] "st_as_grob"                   "st_as_s2"                    
 [43] "st_as_sf"                     "st_as_sfc"                   
 [45] "st_as_text"                   "st_axis_order"               
 [47] "st_bbox"                      "st_bind_cols"                
 [49] "st_boundary"                  "st_break_antimeridian"       
 [51] "st_buffer"                    "st_can_transform"            
 [53] "st_cast"                      "st_centroid"                 
 [55] "st_collection_extract"        "st_combine"                  
 [57] "st_concave_hull"              "st_contains"                 
 [59] "st_contains_properly"         "st_convex_hull"              
 [61] "st_coordinates"               "st_covered_by"               
 [63] "st_covers"                    "st_crop"                     
 [65] "st_crosses"                   "st_crs"                      
 [67] "st_crs<-"                     "st_delete"                   
 [69] "st_difference"                "st_dimension"                
 [71] "st_disjoint"                  "st_distance"                 
 [73] "st_drivers"                   "st_drop_geometry"            
 [75] "st_equals"                    "st_equals_exact"             
 [77] "st_filter"                    "st_geometry"                 
 [79] "st_geometry_type"             "st_geometry<-"               
 [81] "st_geometrycollection"        "st_graticule"                
 [83] "st_inscribed_circle"          "st_interpolate_aw"           
 [85] "st_intersection"              "st_intersects"               
 [87] "st_is"                        "st_is_empty"                 
 [89] "st_is_longlat"                "st_is_simple"                
 [91] "st_is_valid"                  "st_is_within_distance"       
 [93] "st_jitter"                    "st_join"                     
 [95] "st_layers"                    "st_length"                   
 [97] "st_line_interpolate"          "st_line_merge"               
 [99] "st_line_project"              "st_line_sample"              
[101] "st_linestring"                "st_m_range"                  
[103] "st_make_grid"                 "st_make_valid"               
[105] "st_minimum_rotated_rectangle" "st_multilinestring"          
[107] "st_multipoint"                "st_multipolygon"             
[109] "st_nearest_feature"           "st_nearest_points"           
[111] "st_node"                      "st_normalize"                
[113] "st_overlaps"                  "st_perimeter"                
[115] "st_point"                     "st_point_on_surface"         
[117] "st_polygon"                   "st_polygonize"               
[119] "st_precision"                 "st_precision<-"              
[121] "st_read"                      "st_read_db"                  
[123] "st_relate"                    "st_reverse"                  
[125] "st_sample"                    "st_segmentize"               
[127] "st_set_agr"                   "st_set_crs"                  
[129] "st_set_geometry"              "st_set_precision"            
[131] "st_sf"                        "st_sfc"                      
[133] "st_shift_longitude"           "st_simplify"                 
[135] "st_snap"                      "st_sym_difference"           
[137] "st_touches"                   "st_transform"                
[139] "st_triangulate"               "st_triangulate_constrained"  
[141] "st_union"                     "st_viewport"                 
[143] "st_voronoi"                   "st_within"                   
[145] "st_wrap_dateline"             "st_write"                    
[147] "st_write_db"                  "st_z_range"                  
[149] "st_zm"                        "vec_cast.sfc"                
[151] "vec_ptype2.sfc"               "write_sf"                    

Exploring the Simple Features (sf) package

  • Convert the existing oregon cities data frame to a simple feature by:
  • Supplying the longitude and latitude (x and y).
  • Defining the (geographic) coordinate reference system (crs).

Exploring the Simple Features (sf) package

oregon_cities <- oregon_cities %>%
  st_as_sf(coords = c('longitude', 'latitude'), crs = 4269)
print(oregon_cities)
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.054 ymin: 42.189 xmax: -121.313 ymax: 45.523
Geodetic CRS:  NAD83
  id    cities population                geometry
1  1   Ashland      20062 POINT (-122.699 42.189)
2  2 Corvallis      50297  POINT (-123.275 44.57)
3  3      Bend      61362 POINT (-121.313 44.061)
4  4  Portland     537557  POINT (-122.67 45.523)
5  5   Newport       9603 POINT (-124.054 44.652)

Exploring the Simple Features (sf) package

The oregon_cities object has now changed from being a standard data frame and includes features that are required for a true spatial object.

  • Geometry type
  • Bounding box
  • Coordinate reference system

Latitude and longitude columns have been moved to a new column called "geometry" (sticky).

Coordinate Reference Systems

sf also has functionality to re-project and manipulate spatial objects.

Figure 7: Image from: https://nceas.github.io/oss-lessons/spatial-data-gis-law/1-mon-spatial-data-intro.html

Coordinate Reference Systems

oregon_cities is currently in degrees, but certain applications may require an equal area projection and length units, such as meters. See: epsg.org

With sf we can:

  • Check to see if the current CRS is equal to the Albers Equal-Area Conic Projection
  • Transform oregon_cities to CRS 5070

Coordinate Reference Systems

st_crs(oregon_cities) == st_crs(5070)
[1] FALSE
oregon_cities <- 
  oregon_cities %>% 
  st_transform(crs = 5070)

Coordinate Reference Systems

Now let’s plot the the transformed data…

ggplot(data = oregon_cities) +
  geom_sf_text(aes(label = cities),
               hjust=0, vjust=1.5) +
  geom_sf(aes(size = population)) + 
  xlab('Longitude') +
  ylab('Latitude') +
  theme_bw()

Coordinate Reference Systems

Figure 8: Oregon cities plotted in Albers Equal Area Projection.

It all feels like R

  • There can be huge advantages to doing GIS tasks in R without going back and forth to other GIS software
  • If you are familiar with R, the leap to doing GIS here can be small
  • sf provides a large number of GIS functions, such as buffers, intersection, centroids, etc.

It all feels like R

Example 1: Add 100 Km buffer to cities

cities_buffer <- 
  oregon_cities %>% 
  st_buffer(100000)

Plot map of buffered cities…

ggplot(data = cities_buffer) +
  geom_sf(aes(fill = cities), alpha = 0.5) +
  geom_sf(data = st_centroid(oregon_cities)) +
  theme_bw()

It all feels like R

Buffered cities w/ overlapping buffers.

It all feels like R

Example 2: Split buffers into sub-units and calculate areas

cities_buffer <- cities_buffer %>% 
  st_intersection() %>%
  mutate(area = st_area(.) %>% 
           units::drop_units(),
         id = as.factor(1:nrow(.)))

Plot results:

ggplot(data = cities_buffer) +
  geom_sf(aes(fill = id), alpha = 0.5) +
  theme_bw()

It all feels like R

Buffered cities w/ overlapping buffers split.

Raster data

  • Another fundamental data type in GIS is the raster
  • Rasters are a way of displaying gridded data, where each member of the grid represents a landscape feature (e.g., elevation)
  • The terra package is now the best package for working with rasters

Raster data

Much like sf, terra has a large number of functions for working with raster data.

library(terra)
ls("package:terra")
  [1] "%in%"                  "activeCat"             "activeCat<-"          
  [4] "add_box"               "add_legend"            "add<-"                
  [7] "addCats"               "adjacent"              "aggregate"            
 [10] "align"                 "all.equal"             "allNA"                
 [13] "animate"               "app"                   "approximate"          
 [16] "area"                  "Arith"                 "as.array"             
 [19] "as.bool"               "as.contour"            "as.data.frame"        
 [22] "as.factor"             "as.int"                "as.lines"             
 [25] "as.list"               "as.matrix"             "as.points"            
 [28] "as.polygons"           "as.raster"             "atan_2"               
 [31] "atan2"                 "autocor"               "barplot"              
 [34] "blocks"                "boundaries"            "boxplot"              
 [37] "buffer"                "cartogram"             "catalyze"             
 [40] "categories"            "cats"                  "cbind2"               
 [43] "cellFromRowCol"        "cellFromRowColCombine" "cellFromXY"           
 [46] "cells"                 "cellSize"              "centroids"            
 [49] "clamp"                 "clamp_ts"              "classify"             
 [52] "clearance"             "click"                 "colFromCell"          
 [55] "colFromX"              "colorize"              "coltab"               
 [58] "coltab<-"              "combineGeoms"          "compare"              
 [61] "Compare"               "compareGeom"           "concats"              
 [64] "contour"               "convHull"              "costDist"             
 [67] "countNA"               "cover"                 "crds"                 
 [70] "crop"                  "crosstab"              "crs"                  
 [73] "crs<-"                 "datatype"              "deepcopy"             
 [76] "delaunay"              "densify"               "density"              
 [79] "depth"                 "depth<-"               "describe"             
 [82] "diff"                  "direction"             "disagg"               
 [85] "distance"              "dots"                  "draw"                 
 [88] "droplevels"            "elongate"              "emptyGeoms"           
 [91] "erase"                 "expanse"               "ext"                  
 [94] "ext<-"                 "extend"                "extract"              
 [97] "extractAlong"          "extractRange"          "fileBlocksize"        
[100] "fillHoles"             "fillTime"              "flip"                 
[103] "focal"                 "focal3D"               "focalCor"             
[106] "focalCpp"              "focalMat"              "focalPairs"           
[109] "focalReg"              "focalValues"           "forceCCW"             
[112] "free_RAM"              "freq"                  "gaps"                 
[115] "gdal"                  "gdalCache"             "geom"                 
[118] "geomtype"              "getGDALconfig"         "getTileExtents"       
[121] "global"                "graticule"             "gridDist"             
[124] "gridDistance"          "halo"                  "has.colors"           
[127] "has.RGB"               "has.time"              "hasMinMax"            
[130] "hasValues"             "head"                  "hist"                 
[133] "identical"             "ifel"                  "image"                
[136] "impose"                "inext"                 "init"                 
[139] "inMemory"              "inset"                 "interpIDW"            
[142] "interpNear"            "interpolate"           "intersect"            
[145] "is.bool"               "is.empty"              "is.factor"            
[148] "is.int"                "is.lines"              "is.lonlat"            
[151] "is.points"             "is.polygons"           "is.related"           
[154] "is.rotated"            "is.valid"              "isFALSE"              
[157] "isTRUE"                "k_means"               "lapp"                 
[160] "layerCor"              "levels"                "linearUnits"          
[163] "lines"                 "logic"                 "Logic"                
[166] "longnames"             "longnames<-"           "makeNodes"            
[169] "makeTiles"             "makeValid"             "makeVRT"              
[172] "map.pal"               "mask"                  "match"                
[175] "math"                  "Math"                  "Math2"                
[178] "mean"                  "median"                "mem_info"             
[181] "merge"                 "mergeLines"            "mergeTime"            
[184] "meta"                  "metags"                "metags<-"             
[187] "minCircle"             "minmax"                "minRect"              
[190] "modal"                 "mosaic"                "na.omit"              
[193] "NAflag"                "NAflag<-"              "names"                
[196] "ncell"                 "ncol"                  "ncol<-"               
[199] "nearby"                "nearest"               "nlyr"                 
[202] "nlyr<-"                "noNA"                  "normalize.longitude"  
[205] "north"                 "not.na"                "nrow"                 
[208] "nrow<-"                "nsrc"                  "origin"               
[211] "origin<-"              "pairs"                 "panel"                
[214] "patches"               "perim"                 "persp"                
[217] "plet"                  "plot"                  "plotRGB"              
[220] "points"                "polys"                 "prcomp"               
[223] "predict"               "princomp"              "project"              
[226] "quantile"              "query"                 "rangeFill"            
[229] "rapp"                  "rast"                  "rasterize"            
[232] "rasterizeGeom"         "rasterizeWin"          "rcl"                  
[235] "readRDS"               "readStart"             "readStop"             
[238] "readValues"            "rectify"               "regress"              
[241] "relate"                "removeDupNodes"        "res"                  
[244] "res<-"                 "resample"              "rescale"              
[247] "rev"                   "RGB"                   "RGB<-"                
[250] "roll"                  "rotate"                "round"                
[253] "rowColCombine"         "rowColFromCell"        "rowFromCell"          
[256] "rowFromY"              "same.crs"              "sapp"                 
[259] "saveRDS"               "sbar"                  "scale"                
[262] "scoff"                 "scoff<-"               "sds"                  
[265] "segregate"             "sel"                   "selectHighest"        
[268] "selectRange"           "serialize"             "set.cats"             
[271] "set.crs"               "set.ext"               "set.names"            
[274] "set.RGB"               "set.values"            "setGDALconfig"        
[277] "setMinMax"             "setValues"             "shade"                
[280] "sharedPaths"           "shift"                 "sieve"                
[283] "simplifyGeom"          "size"                  "snap"                 
[286] "sort"                  "sources"               "spatSample"           
[289] "spin"                  "split"                 "sprc"                 
[292] "stdev"                 "stretch"               "subset"               
[295] "subst"                 "summary"               "Summary"              
[298] "svc"                   "symdif"                "t"                    
[301] "tail"                  "tapp"                  "terrain"              
[304] "terraOptions"          "text"                  "tighten"              
[307] "time"                  "time<-"                "timeInfo"             
[310] "tmpFiles"              "trans"                 "trim"                 
[313] "union"                 "unique"                "units"                
[316] "units<-"               "unserialize"           "unwrap"               
[319] "update"                "values"                "values<-"             
[322] "varnames"              "varnames<-"            "vect"                 
[325] "vector_layers"         "viewshed"              "voronoi"              
[328] "vrt"                   "vrt_tiles"             "weighted.mean"        
[331] "where.max"             "where.min"             "which.lyr"            
[334] "which.max"             "which.min"             "width"                
[337] "window"                "window<-"              "wrap"                 
[340] "wrapCache"             "writeCDF"              "writeRaster"          
[343] "writeStart"            "writeStop"             "writeValues"          
[346] "writeVector"           "xapp"                  "xFromCell"            
[349] "xFromCol"              "xmax"                  "xmax<-"               
[352] "xmin"                  "xmin<-"                "xres"                 
[355] "xyFromCell"            "yFromCell"             "yFromRow"             
[358] "ymax"                  "ymax<-"                "ymin"                 
[361] "ymin<-"                "yres"                  "zonal"                
[364] "zoom"                 

Raster data

Create and define raster

r <- rast(ncol=10, nrow = 10)

r[] <- runif(n=ncell(r))

r
class       : SpatRaster 
dimensions  : 10, 10, 1  (nrow, ncol, nlyr)
resolution  : 36, 18  (x, y)
extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source(s)   : memory
name        :      lyr.1 
min value   : 0.01871994 
max value   : 0.99392609 

Raster data

plot(r)

Basic raster in R.

Raster data

We can access data from specific locations within a raster

# Access data from the ith location in a raster
r[12]
      lyr.1
1 0.5669319
# Access data based on row and column
r[2, 2] 
      lyr.1
1 0.5669319

Raster data

Rasters can be stacked which can make them very efficient to work with

# Create 2 new rasters based on raster r
r2 <- r * 50
r3 <- sqrt(r * 5)

# Stack rasters and rename to be unique
s <- c(r, r2, r3)
names(s) <- c('r1', 'r2', 'r3')

Raster data

plot(s)

Working with real data

  • There are several packages for accessing geospatial data from the web
  • We will use the FedData package, but numerous other packages exist to access data within and without the U.S.
  • One useful example is the elevatr package for accessing elevation data around the world

Working with real data

We will walk through an example of extracting information from a raster using a polygon layer. To do this we will:

  • Select just Corvallis among oregon_cities
  • Add a 10,000m buffer
  • Download National Elevation Data
  • Transform the projection system of the elevation data to match Corvallis
  • Calculate the average elevation within 10km of Corvallis

Working with real data

  1. Buffer Corvallis
library(FedData)

# Select just Corvallis and calculate a 10,000-m buffer
corvallis <- 
  oregon_cities %>%
  filter(cities == 'Corvallis') %>% 
  st_buffer(10000)

Working with real data

  1. Download NED based on Corvallis buffer
# Download national elevation data (ned)
ned <- FedData::get_ned(
  template = corvallis,
  label = "corvallis")
  1. Transform the CRS
ned <- terra::project(ned, 
                      'epsg:5070',
                      method = 'bilinear')

Working with real data

  1. Mean elevation within polygon
# zonal function in terra to calculate zonal statistics
terra::zonal(ned, 
             
             # Need to convert corvallis `sf` object to terra vector
             terra::vect(corvallis), 
             
             # Metric to be calculated
             mean, na.rm = T)
   Layer_1
1 119.6678

Your Turn

  1. Read in U.S. cities with data('us.cities') from the maps library
  2. Select the city of your choice and buffer it by 10Km. (We suggest converting to an equal area projection first)
  3. Read in National Elevation Data for your city with the FedData package
  4. Transform CRS of elevation data to match city
  5. Calculate the mean elevation within 10km of your city
08:00

Solution

1-2. Buffer city of your choice

library(maps)
data('us.cities')

my_city <- us.cities %>% 
  filter(name == 'Idaho Falls ID') %>% 
  st_as_sf(coords = c('long', 'lat'), crs = 4269) %>% 
  st_transform(crs = 5070) %>% 
  st_buffer(10000)

Solution

  1. Read in elevation data
ned <- FedData::get_ned(
  template = my_city,
  label = "Idaho Falls")

Solution

  1. Transform CRS
ned <- terra::project(ned, 
                      'epsg:5070',
                      method = 'bilinear')

Solution

  1. Calculate mean elevation within buffer
terra::zonal(ned, 
             terra::vect(my_city), 
             mean, na.rm = T)
  USGS_1_n44w112
1       1450.104

Watershed Delineation

  • Characterizing watersheds is fundamental to much of our work in freshwater science
  • Although it is more than we can cover in today’s workshop, we want you to be aware that there are several options for delineating watersheds in R
  • We’ll provide two examples of how to delineate watersheds within the conterminous U.S. using two online services

USGS StreamStats

The USGS’s StreamStats is an online service and map interface that allows users to navigate to a desired location and delineate a watershed boundary with the click of a mouse:

https://streamstats.usgs.gov/ss/

In addition to the map interface, the data are also accessible via an API:

https://streamstats.usgs.gov/docs/streamstatsservices

USGS StreamStats

It is also possible to replicate this functionality in R:

streamstats_ws = function(state, longitude, latitude){
  p1 = 'https://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode='
  p2 = '&xlocation='
  p3 = '&ylocation='
  p4 = '&crs=4269&includeparameters=false&includeflowtypes=false&includefeatures=true&simplify=true'
  query <-  paste0(p1, state, p2, toString(longitude), p3, toString(latitude), p4)
  mydata <- jsonlite::fromJSON(query, simplifyVector = FALSE, simplifyDataFrame = FALSE)
  poly_geojsonsting <- jsonlite::toJSON(mydata$featurecollection[[2]]$feature, auto_unbox = TRUE)
  poly <- geojsonio::geojson_sf(poly_geojsonsting) 
  poly
}

# Define location for delineation (Calapooia Watershed)
state <- 'OR'
latitude <- 44.62892
longitude <- -123.13113

# Delineate watershed
cal_ws <- streamstats_ws('OR', longitude, latitude) %>% 
  st_transform(crs = 5070)

USGS StreamStats

nhdplusTools

  • nhdplusTools is an R package that can access the Network Linked Data Index (NLDI) service, which provides navigation and extraction of NHDPlus data: https://doi-usgs.github.io/nhdplusTools/
  • nhdplusTools includes network navigation options as well as watershed delineation
  • The delineation method differs from StreamStats (based on National Hydrography IDs)

nhdplusTools

library(nhdplusTools)

# Simple feature option to generate point without any other attributes
cal_pt <- st_sfc(st_point(c(longitude, latitude)), crs = 4269)

# Identify the network location (NHDPlus common ID or COMID)
start_comid <- nhdplusTools::discover_nhdplus_id(cal_pt)

# Combine info into list (required by NLDI basin function)
ws_source <- list(featureSource = "comid", featureID = start_comid)

cal_ws2 <- nhdplusTools::get_nldi_basin(nldi_feature = ws_source)

nhdplusTools

nhdplusTools

nhdplusTools works by walking a network of pre-staged sub-catchments with unique IDs (COMIDs)

Other watershed delination methods

  • Outside of the U.S., these tools are not available
  • It is still possible to delineate a custom watershed from raw DEM data with help from the whitebox R package
  • This book walks through this process
  • Jeff Hollister (U.S. EPA) developed an R package called elevatr that can access DEM data from the web both within and outside the U.S.

Your Turn

  1. Delineate the Logan River watershed in Utah at -111.855, 41.707
  2. Use mapview::mapview to plot watershed
08:00

Solution

  1. Delineate Logan River watershed
# Logan River Watershed
latitude <- 41.707
longitude <- -111.855

# Define the lat/lon
start_point <- st_sfc(st_point(c(longitude, latitude)), crs = 4269)
# Find COMID of this point
start_comid <- nhdplusTools::discover_nhdplus_id(start_point)
# Create a list object that defines the feature source and starting COMID
ws_source <- list(featureSource = "comid", featureID = start_comid)
# Delineate basin
logan_ws <- nhdplusTools::get_nldi_basin(nldi_feature = ws_source)

Solution

  1. Map the Logan River watershed
mapview::mapview(logan_ws)

StreamCatTools

Why StreamCat/LakeCat?

  • Examples covered are for single watersheds - but we often need data for hundreds or thousands
  • Overlaying watersheds onto numerous landscape rasters/layers is complicated/long process
  • StreamCat/LakeCat solves this issue by providing summarized data for all streams and lakes in the NHDPlus (medium resolution)

Why StreamCat/LakeCat?

Let’s revisit the Calapooia watershed and calculate percent row crop for 2019 with FedData

  • Download NLCD for 2019
nlcd <- get_nlcd(
  template = cal_ws2,
  year = 2019,
  label = 'Calapooia') %>%
  terra::project('epsg:5070', method = 'near') 

Why StreamCat/LakeCat?

  • Use terra to extract the watershed metrics
terra::extract(nlcd,
               terra::vect(cal_ws2)) %>%
  group_by(Class) %>%
  summarise(count = n()) %>%
  mutate(percent = (count / sum(count)) * 100) %>%
  filter(Class == 'Cultivated Crops') %>%
  print()
# A tibble: 1 × 3
  Class             count percent
  <fct>             <int>   <dbl>
1 Cultivated Crops 112485    10.5

Why StreamCat/LakeCat?

Now let’s extract the same information using the StreamCatTools package to access the online StreamCat database

library(StreamCatTools)

comid <- '23763521'

sc_get_data(comid = comid,
            metric = 'PctCrop2019', 
            aoi = 'watershed')
# A tibble: 1 × 3
     COMID WSAREASQKM PCTCROP2019WS
     <dbl>      <dbl>         <dbl>
1 23763521       965.          10.5

StreamCatTools

  • StreamCat and the StreamCatTools package provide access to the same information with much less code because StreamCat pre-stages watershed metrics for each NHDPlus catchment

StreamCatTools

StreamCat also provides metrics at several scales

Figure 9: Geospatial framework of the StreamCat Dataset

StreamCatTools

sc_get_data(comid = comid,
            metric = 'PctCrop2019', 
            aoi = 'catchment,watershed')
# A tibble: 1 × 5
     COMID WSAREASQKM CATAREASQKM PCTCROP2019CAT PCTCROP2019WS
     <dbl>      <dbl>       <dbl>          <dbl>         <dbl>
1 23763521       965.        8.49           36.6          10.5

StreamCatTools

For a subset of watershed metrics, summaries within ~100m of the stream segment are also available for catchments and watersheds:

Figure 10: Riparian buffers (red) of NHD stream lines (white) and on-network NLCD water pixels (blue).

StreamCatTools

sc_get_data(comid = comid,
            metric = 'PctCrop2019', 
            aoi = 'catchment,riparian_catchment,watershed,riparian_watershed') %>% 
  as.data.frame()
     COMID WSAREASQKM WSAREASQKMRP100 CATAREASQKM CATAREASQKMRP100
1 23763521   964.6101          136.17      8.4933           1.1097
  PCTCROP2019CATRP100 PCTCROP2019WSRP100 PCTCROP2019CAT PCTCROP2019WS
1                9.08                8.4          36.59         10.49

StreamCatTools

StreamCatTools also makes it very easy to grab data for entire regions.

library(ggplot2)

iowa_crop <- sc_get_data(state = 'IA',
                         metric = 'PctCrop2019', 
                         aoi = 'watershed')

StreamCatTools

ggplot() + 
  geom_histogram(data = iowa_crop,
                 aes(x = PCTCROP2019WS)) + 
  theme_bw()

StreamCatTools

Histogram of crop as a % of watersheds within Iowa in 2019.

StreamCatTools

  • We can provide multiple metrics to the request by separating them with a comma.
  • Note that the request is provided as a single string, 'PctCrop2001, PctCrop2019', rather than a vector of metrics: c('PctCrop2001', 'PctCrop2019').
  • The request itself is agnostic to formatting of the text. For example, these requests will also work: 'pctcrop2001, pctcrop2019' or 'PCTCROP2001,PCTCROP2019'.

StreamCatTools

StreamCat contains hundreds of metrics and we recommend consulting the metric list to identify those of interest for your study.

Accessing LakeCat

The R function to access LakeCat data was designed to parallel StreamCat functions

Let’s walk through an example together that:

  • Define a sf object of a sample point at Pelican Lake, WI
  • Obtain the lake waterbody polygon
  • Extract the COMID (unique ID) to query LakeCat
  • Pull data on mean watershed elevation, calcium oxide content of the geology, % sand and organic matter content of soils, and % of the watershed composed of deciduous forest

Accessing LakeCat

library(nhdplusTools)

# Pelican Lake, WI
latitude <- 45.502840
longitude <- -89.198694

pelican_pt <- data.frame(site_name = 'Pelican Lake',
                         latitude = latitude,
                         longitude = longitude) %>% 
  st_as_sf(coords = c('longitude', 'latitude'), crs = 4326)

pelican_lake <- nhdplusTools::get_waterbodies(pelican_pt) 

comid <- pelican_lake %>% 
  pull(comid)

lc_get_data(metric = 'elev, cao, sand, om, pctdecid2019',
            aoi = 'watershed',
            comid = comid)
# A tibble: 1 × 7
      COMID WSAREASQKM SANDWS ELEVWS CAOWS PCTDECID2019WS  OMWS
      <dbl>      <dbl>  <dbl>  <dbl> <dbl>          <dbl> <dbl>
1 167120863       41.0   59.6   491.  4.81           14.7  16.7

Your Turn

Query LakeCat to determine if the basins of these lakes had more deciduous or conifer forest in 2019:

“9028333,9025609,9025611,9025635”

The LakeCat metrics page may help

05:00

Solution

library(StreamCatTools)

comids <- "9028333,9025609,9025611,9025635"

lc_get_data(metric = 'pctdecid2019, pctconif2019',
            aoi = 'watershed',
            comid = comids) 
# A tibble: 4 × 4
    COMID WSAREASQKM PCTDECID2019WS PCTCONIF2019WS
    <dbl>      <dbl>          <dbl>          <dbl>
1 9025609      5.27            18.1          0    
2 9025611      0.447           19.3          0    
3 9025635     13.6             22.3          0.339
4 9028333      6.62            34.5          2.32 

Bringing it all together

  • To finish the workshop, we will walk through 2 examples of using spmodel to model phenomena in two types of freshwaters:
    • Electrical conductivity of lake water across Minnesota
    • The distribution of Argia damselfly across several northeastern states

Lake Conductivity

  • Conductivity is an important water quality measure and one of growing concern due to salinization of freshwater (Cañedo-Argüelles et al. 2016, Kaushal et al. 2021)
  • This analysis is based on a recent paper by Michael Dumelle and others published in the journal Spatial Statistics. The GitHub repository for this paper is also available

Data Prep: Conductivity (Dependent) Data

Load required packages…

library(tidyverse)
library(sf)
library(tigris)
library(StreamCatTools)
library(spmodel)
library(data.table)

Data Prep: Conductivity (Dependent) Data

Read and prep table of lake conductivity values…

# Read in states to give some context
states <- tigris::states(cb = TRUE, progress_bar = FALSE)  %>% 
  filter(!STUSPS %in% c('HI', 'PR', 'AK', 'MP', 'GU', 'AS', 'VI'))  %>% 
  st_transform(crs = 5070)

# Read in lakes, select/massage columns, convert to spatial object
lake_cond <- fread('nla_obs.csv') %>% 
  select(UNIQUE_ID, COMID, COND_RESULT,
         AREA_HA, DSGN_CYCLE,
         XCOORD, YCOORD) %>% 
  mutate(DSGN_CYCLE = factor(DSGN_CYCLE)) %>% 
  st_as_sf(coords = c('XCOORD', 'YCOORD'), 
           crs = "+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +type=crs") %>% 
  st_transform(crs = 5070)

Data Prep: Conductivity (Dependent) Data

ggplot() +
  geom_sf(data = states,
          fill = NA) +
  geom_sf(data = lake_cond,
          aes(color = DSGN_CYCLE)) +
  scale_color_manual(values=c("#a6cee3", "#1f78b4", "#b2df8a")) +
  theme_bw() +
  theme(legend.position="bottom") 

Data Prep: Conductivity (Dependent) Data

NLA lake sampes across the U.S.

Data Prep: Conductivity (Dependent) Data

Select sample sites within Minnesota

MN <- states %>% 
  filter(STUSPS == 'MN')

cond_mn <- lake_cond %>% 
  st_filter(MN) %>% 
  rename(year = DSGN_CYCLE)

Data Prep: Conductivity (Dependent) Data

ggplot() +
  geom_sf(data = MN,
          fill = NA) +
  geom_sf(data = cond_mn,
          aes(color = log(COND_RESULT))) +
  scale_color_distiller(palette = 'YlOrRd', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom") 

Data Prep: Conductivity (Dependent) Data

Distribution of conductivity values.

Data Prep: LakeCat (Independent) Data

We will use the similar watershed predictors as Dumelle et al. (2023).

Already included with data table with the response variable are:

  • Lake Area (AREA_HA)
  • Sample year (DSGN_CYCLE)

Data Prep: LakeCat (Independent) Data

From LakeCat, we also will get the following predictor variables:

  • Local long-term air temperature
  • Long-term watershed precipitation
  • Sulfur content of underlying lithology

Data Prep: LakeCat (Independent) Data

comids <- cond_mn$COMID

mn_lakecat <- lc_get_data(comid = comids,
                          metric = 'Tmean8110, Precip8110, S') %>% 
  select(COMID, TMEAN8110CAT, PRECIP8110WS, SWS)

mn_lakecat
# A tibble: 115 × 4
     COMID TMEAN8110CAT PRECIP8110WS    SWS
     <dbl>        <dbl>        <dbl>  <dbl>
 1 1099282         7.69         825. 0.0365
 2 1101812         7.91         812. 0.0329
 3 1768378         3.51         696. 0.0985
 4 1775870         4.28         749. 0.272 
 5 1776124         4.43         778. 0.272 
 6 2032039         7.23         794. 0.0712
 7 2034063         7.25         845. 0.0712
 8 2262633         5.58         768. 0.0712
 9 2262729         5.62         786. 0.0329
10 2350690         6.71         735. 0.0712
# ℹ 105 more rows

Data Prep: LakeCat (Independent) Data

In addition to these static LakeCat data, we would also like to pull in data from specific years of NLCD to match sample years for % of watershed composed of crop area

To do this we will need to:

  • Grab LakeCat NLCD % crop data for years 2006, 2011, 2016
  • Clean and pivot the columns and add a year column
  • Add 1 to each year since available NLCD are 1 year behind field samples

Data Prep: LakeCat (Independent) Data

crop <- 
  
  # Grab LakeCat crop data
  lc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'pctcrop2006, pctcrop2011, pctcrop2016') %>% 
  
  # Remove watershed area from data
  select(-WSAREASQKM) %>% 
  
  # Pivot table to long to create "year" column
  pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTCROPWS') %>% 
  
  # Remove PCTCROP and WS to make "year" column
  mutate(year = as.integer(
    str_replace_all(tmpcol, 'PCTCROP|WS', ''))) %>% 
  
  # Add 1 to each year to match NLA years
  mutate(year = factor(year + 1)) %>% 
  
  # Remove the tmp column
  select(-tmpcol)

Data Prep: LakeCat (Independent) Data

Now, join the tables to make our model data

model_data <- cond_mn %>% 
  left_join(mn_lakecat, join_by(COMID)) %>% 
  left_join(crop, join_by(COMID, year))

Modeling lake conductivity

Define the formula

formula <- 
  log(COND_RESULT) ~ 
  AREA_HA + 
  year + 
  TMEAN8110CAT +
  PRECIP8110WS + 
  PCTCROPWS + 
  SWS

Modeling lake conductivity

Run a non-spatial and spatial model for comparison

cond_mod <- splm(formula = formula,
                 data = model_data,
                 spcov_type = 'none')

cond_spmod <- splm(formula = formula,
                   data = model_data,
                   spcov_type = 'exponential')

summary(cond_spmod)

Call:
splm(formula = formula, data = model_data, spcov_type = "exponential")

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87192 -0.19602  0.02576  0.29597  1.28615 

Coefficients (fixed):
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   8.486e+00  8.813e-01   9.630  < 2e-16 ***
AREA_HA       2.394e-05  5.510e-05   0.435   0.6639    
year2012     -1.270e-01  3.261e-02  -3.896 9.80e-05 ***
year2017     -9.871e-02  3.851e-02  -2.563   0.0104 *  
TMEAN8110CAT  5.375e-01  6.740e-02   7.976 1.55e-15 ***
PRECIP8110WS -8.406e-03  1.313e-03  -6.404 1.51e-10 ***
PCTCROPWS     4.317e-03  2.767e-03   1.560   0.1188    
SWS           1.110e+00  7.740e-01   1.435   0.1514    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared: 0.5024

Coefficients (exponential spatial covariance):
       de        ie     range 
2.685e-01 1.371e-02 2.167e+04 

Modeling lake conductivity

Model comparison

glances(cond_mod, cond_spmod)
# A tibble: 2 × 10
  model         n     p  npar value   AIC  AICc logLik deviance pseudo.r.squared
  <chr>     <int> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1 cond_spm…   162     8     3  161.  167.  167.  -80.5     154.            0.502
2 cond_mod    162     8     1  284.  286.  286. -142.      154             0.775

Modeling lake conductivity

Leave-one-out comparison

prd_mod <- spmodel::loocv(cond_mod, se.fit = TRUE, cv_predict = TRUE) 

prd_spmod <- spmodel::loocv(cond_spmod, se.fit = TRUE, cv_predict = TRUE)

rbind(prd_mod %>% pluck('stats'),
      prd_spmod %>% pluck('stats'))
# A tibble: 2 × 4
      bias  MSPE RMSPE  cor2
     <dbl> <dbl> <dbl> <dbl>
1 -0.00261 0.270 0.519 0.731
2 -0.00392 0.140 0.374 0.860

Map predicted values and standard errors

Join back to model data for mapping

# Combine predictions with model data (spatial points)
model_data <-
  model_data %>% 
  mutate(prd_cond = prd_spmod %>% 
           pluck('cv_predict'), 
         se_fit = prd_spmod %>% 
           pluck('se.fit'))

Map predicted values and standard errors

ggplot() +
  geom_sf(data = MN,
          fill = NA) +
  geom_sf(data = model_data,
          aes(color = prd_cond)) +
  scale_color_distiller(palette = 'YlOrRd', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom") 

Map predicted values and standard errors

Predicted log(conductivity) values.

Map predicted values and standard errors

ggplot() +
  geom_sf(data = MN,
          fill = NA) +
  geom_sf(data = model_data,
          aes(color = se_fit)) +
  scale_color_distiller(palette = 'Reds', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom") 

Map predicted values and standard errors

Model standard errors.

Argia distribution

In this example, we will model the occurrence of the Argia damselfly in several northeastern states

Image from: inaturalist.org

finsyncR

To start this exercise, we’ll introduce a new R package called finsyncR. This package was developed by US EPA scientists and academic collaborators.

finsyncR

Data Prep: Biological (Dependent) Data

Load additional packages

library(finsyncR)
library(pROC)

Data Prep: Biological (Dependent) Data

Next, use finsyncR to get genus-level macroinvert data from just EPA and rarefy to 300 count

The code also converts the data to occurrence data (1 = detect, 0 = non-detect) and set a seed to make it reproducible.

macros <- getInvertData(dataType = "occur",
                        taxonLevel = "Genus",
                        agency = "EPA",
                        lifestage = FALSE,
                        rarefy = TRUE,
                        rarefyCount = 300,
                        sharedTaxa = FALSE,
                        seed = 1,
                        boatableStreams = T)

 finsyncR is running: Gathering, joining, and cleaning EPA raw data                    
 finsyncR is running: Rarefying EPA data                              
 finsyncR is running: Applying taxonomic fixes to EPA data                    
 finsyncR is running: Finalizing data for output                          
 finsyncR data synchronization complete                                      
print(dim(macros))
[1] 6174  856

Data Prep: Biological (Dependent) Data

Clean data:

  • Select columns of interest
  • Remove EPA’s Wadeable Streams Assessment (2001-2004)
  • Convert “CollectionDate” to a date (lubridate::date()) and convert presence/absence to a factor
  • Convert table to a sf object and transform to EPSG:5070

Data Prep: Biological (Dependent) Data

# Flexible code so we could model another taxon
genus <- 'Argia'

taxon = macros %>%
  dplyr::select(SampleID, 
                ProjectLabel, 
                CollectionDate,  
                Latitude_dd,
                Longitude_dd,
                all_of(genus))  %>%
  #filter(ProjectLabel != 'WSA') %>% 
  mutate(CollectionDate = date(CollectionDate),
         presence = 
           as.factor(pull(., genus)))  %>% 
  st_as_sf(coords = c('Longitude_dd', 'Latitude_dd'), crs = 4269)  %>% 
  st_transform(crs = 5070)

Data Prep: Biological (Dependent) Data

ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = presence),
          size = 1.5,
          alpha = 0.65) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom") 

Data Prep: Biological (Dependent) Data

Occurrence of Argia.

Data Prep: Biological (Dependent) Data

ggplot() + 
  geom_sf(data = states, fill = NA) +
  geom_sf(data = taxon, 
          aes(color = ProjectLabel),
          size = 1.5,
          alpha = 0.75) + 
  scale_color_manual(values=c("#a6cee3", "#1f78b4", "#b2df8a", "#33a02c")) +
  theme_bw() +
  theme(legend.position="bottom") 

Data Prep: Biological (Dependent) Data

NRSA samples.

Data Prep: Biological (Dependent) Data

For today’s exercise, we’ll narrow down the samples to the northeaster region of the U.S.

  • Select states from the northeaster U.S.
  • Select NRSA sample sites that intersect with these states.
  • Filter to just the 2013/2014 and 2018/2019 sample periods.
  • Create a new column for sample year.
  • Select desired columns.

Data Prep: Biological (Dependent) Data

region <- states %>% 
  filter(STUSPS %in% c('VT', 'NH', 'ME', 'NY', 'RI',
                       'MA', 'CT', 'NJ', 'PA', 'DE'))

# Use region as spatial filter (sf::st_filter()) for taxon of interest
taxon_rg <- taxon %>% 
  st_filter(region) %>% 
  filter(ProjectLabel %in% c('NRSA1314', 'NRSA1819')) %>% 
  mutate(year = year(ymd(CollectionDate))) %>% 
  select(SampleID:CollectionDate, presence:year) 

taxon_rg %>% 
  pull(presence) %>% 
  table()
.
  0   1 
485  65 

Data Prep: Biological (Dependent) Data

ggplot() + 
  geom_sf(data = region, fill = NA) +
  geom_sf(data = taxon_rg, 
          aes(color = presence)) + 
  scale_color_manual(values=c("#d9d9d9", "#08519c")) +
  theme_bw() +
  theme(legend.position="bottom") 

Data Prep: Biological (Dependent) Data

Northeast sites.

Data Prep: Predictor (Independent) Data

  1. Obtain list of NHDPlus COMIDs that match sample sites from nhdplusTools
  • Use NLDI service via StreamCat to get the COMIDs
  • Create a vector of COMIDs by splitting the COMID string
  • Add COMID to our Argia occurrence table

Data Prep: Predictor (Independent) Data

comids <- sc_get_comid(taxon_rg)

#comids <- read_rds('./data/nrsa_comids.rds')
comid_vect <- 
  comids %>%
  str_split(',') %>%
  unlist() %>%
  as.integer()

taxon_rg <- 
  taxon_rg %>%
  mutate(COMID = comid_vect) 

Data Prep: Predictor (Independent) Data

  1. Get non-varying StreamCat data.
sc <- 
  sc_get_data(comid = comids,
              aoi = 'watershed',
              metric = 'bfi, precip8110, wetindex, elev',
              showAreaSqKm = TRUE)

Data Prep: Predictor (Independent) Data

  1. PRISM air temperatures for sample periods
  • The prism package requires that we set a temporary folder in our work space. Here, we set it to “prism_data” inside of our “data” folder. It will create this folder if it does not already exist.
  • We then stack the climate rasters and use terra::extract() to

Data Prep: Predictor (Independent) Data

Don’t worry - we’ll walk through this together

library(prism)

# Get these years of PRISM
years <- c(2013, 2014, 2018, 2019)

# Set the PRISM directory (creates directory in not present)
prism_set_dl_dir("./data/prism_data", create = TRUE)

# Download monthly PRISM rasters (tmean)
get_prism_monthlys('tmean', 
                   years = years, 
                   mon = 7:8, 
                   keepZip = FALSE)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |======================================================================| 100%
# Create stack of downloaded PRISM rasters
tmn <- pd_stack((prism_archive_subset("tmean","monthly", 
                                      years = years, 
                                      mon = 7:8)))

# Extract tmean at sample points and massage data
tmn <- terra::extract(tmn, 
                      # Transform taxon_rg to CRS of PRISM on the fly
                      taxon_rg %>% 
                        st_transform(crs = st_crs(tmn))) %>%
  
  # Add COMIDs to extracted values
  data.frame(COMID = comid_vect, .) %>%
  
  # Remove front and back text from PRISM year/month in names
  rename_with( ~ stringr::str_replace_all(., 'PRISM_tmean_stable_4kmM3_|_bil', '')) %>% 
  
  # Pivot to long table and calle column TMEANPRISMXXXXPT, XXXX indicates year
  pivot_longer(!COMID, names_to = 'year_month', 
               values_to = 'TMEANPRISMXXXXPT') %>% 
  
  # Create new column of year
  mutate(year = year(ym(year_month))) %>% 
  
  # Average July and August temperatures 
  summarise(TMEANPRISMXXXXPT = mean(TMEANPRISMXXXXPT, na.rm = TRUE), 
            .by = c(COMID, year))

Data Prep: Predictor (Independent) Data

Combine Dependent and Independent Data

model_data <-
  taxon_rg %>%
  left_join(sc, join_by(COMID)) %>%
  left_join(tmn, join_by(COMID, year)) %>%
  drop_na()

Modeling occurrence of genus Argia

Model formulation

formula <-
  presence ~
  I(log10(WSAREASQKM)) +
  ELEVWS +
  WETINDEXWS +
  BFIWS +
  PRECIP8110WS +
  TMEANPRISMXXXXPT

Modeling occurrence of genus Argia

Run non-spatial and spatial models for comparison

bin_mod <- spglm(formula = formula,
                 data = model_data,
                 family = 'binomial',
                 spcov_type = 'none')

bin_spmod <- spglm(formula = formula,
                   data = model_data,
                   family = 'binomial',
                   spcov_type = 'exponential')

summary(bin_spmod)

Call:
spglm(formula = formula, family = "binomial", data = model_data, 
    spcov_type = "exponential")

Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-1.4200 -0.4453 -0.3030 -0.1717  2.6018 

Coefficients (fixed):
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)           9.603140   5.862970   1.638  0.10144    
I(log10(WSAREASQKM))  0.760753   0.190182   4.000 6.33e-05 ***
ELEVWS               -0.007621   0.002402  -3.172  0.00151 ** 
WETINDEXWS            0.001087   0.003466   0.314  0.75384    
BFIWS                -0.050563   0.039372  -1.284  0.19906    
PRECIP8110WS         -0.002916   0.002653  -1.099  0.27172    
TMEANPRISMXXXXPT     -0.300855   0.140418  -2.143  0.03215 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared: 0.09293

Coefficients (exponential spatial covariance):
       de        ie     range 
2.458e+00 1.210e-02 3.703e+04 

Coefficients (Dispersion for binomial family):
dispersion 
         1 

Model comparison/performance

glances(bin_mod, bin_spmod)
# A tibble: 2 × 10
  model         n     p  npar value   AIC  AICc logLik deviance pseudo.r.squared
  <chr>     <int> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1 bin_spmod   550     7     3 1391. 1397. 1397.  -696.     219.           0.0929
2 bin_mod     550     7     1 1403. 1405. 1405.  -701.     329.           0.136 

Model comparison/performance

# Function to convert from log odds to probability
to_prob <- function(x) exp(x)/(1+exp(x))

# loocv of non-spatial model
loocv_mod <- loocv(bin_mod, cv_predict = TRUE, se.fit = TRUE) 

# loocv of spatial model
loocv_spmod <- loocv(bin_spmod, cv_predict = TRUE, se.fit = TRUE)

pROC::auc(model_data$presence, loocv_mod$cv_predict)
Area under the curve: 0.7765
pROC::auc(model_data$presence, loocv_spmod$cv_predict)
Area under the curve: 0.9085

Map predictions and standard errors

Extract values from loocv() object and join to model data for mapping

prob_spmod <- to_prob(loocv_spmod$cv_predict)
sefit_spmod <- loocv_spmod$se.fit

model_data <- model_data %>%
  mutate(prob_spmod = prob_spmod,
         sefit_spmod = sefit_spmod)

Map predictions and standard errors

ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = model_data,
          aes(color = prob_spmod)) +
  scale_color_viridis_b() +
  theme_bw() +
  theme(legend.position="bottom")

Map predictions and standard errors

Predicted probabilities.

Map predictions and standard errors

ggplot() +
  geom_sf(data = region, fill = NA) +
  geom_sf(data = model_data,
          aes(color = sefit_spmod)) +
  scale_color_distiller(palette = 'Reds', direction = 1) +
  theme_bw() +
  theme(legend.position="bottom")

Map predictions and standard errors

Argia model standard errors.

Acknowledgements

  • Thank you so much for attending!
  • Questions?